Skip to content

feat: voice-activity streaming mode & inner-vad for speech-to-text module#1160

Merged
IgorSwat merged 28 commits into
mainfrom
@is/vad-streaming
May 22, 2026
Merged

feat: voice-activity streaming mode & inner-vad for speech-to-text module#1160
IgorSwat merged 28 commits into
mainfrom
@is/vad-streaming

Conversation

@IgorSwat

@IgorSwat IgorSwat commented May 20, 2026

Copy link
Copy Markdown
Contributor

Description

This PR introduces changes focused on voice-activity-detection module and it's utilization within the library:

  • Native side VAD streaming - introduces a continuous voice-activity-detection mechanism with user-friendly callback system. Example usage from demo app:
  await model.stream({
    onSpeechBegin: () => {...},
    onSpeechEnd: () => {...},
    options: {...},
  });
  • VAD x STT integration - adds an option to utilize voice-activity-detection within the speech-to-text module, significantly improving the effective performance of the STT.
  • Demo apps: introduces new screen in the speech demo app: VoiceActivityDetectionScreen and changes the behavior of SpeechToTextScreen, adding a toggle to switch the VAD submodule for STT on/off.

Introduces a breaking change?

  • Yes
  • No

Type of change

  • Bug fix (change which fixes an issue)
  • New feature (change which adds functionality)
  • Documentation update (improves or adds clarity to existing documentation)
  • Other (chores, tests, code style improvements etc.)

Tested on

  • iOS
  • Android

Testing instructions

  • To test the VAD streaming: run the VoiceActivityDetectionScreen within the Speech demo app.
  • To test the VAD & STT integration: run the SpeechToTextScreen within the Speech demo app, with VAD toggle on.

Screenshots

Related issues

#1118

Checklist

  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas
  • I have updated the documentation accordingly
  • My changes generate no new warnings

Additional notes

@IgorSwat IgorSwat requested review from chmjkb and msluszniak May 20, 2026 13:09
@IgorSwat IgorSwat force-pushed the @is/vad-streaming branch from 694fe4f to 1c2411e Compare May 20, 2026 13:15
@IgorSwat IgorSwat changed the base branch from main to @is/speech-to-text-ultimate May 20, 2026 13:26
@IgorSwat IgorSwat force-pushed the @is/speech-to-text-ultimate branch from 02113ff to 6bba141 Compare May 20, 2026 15:46
Comment thread apps/speech/screens/SpeechToTextScreen.tsx
Comment thread apps/speech/screens/VoiceActivityDetectionScreen.tsx
Base automatically changed from @is/speech-to-text-ultimate to main May 21, 2026 08:20
@IgorSwat IgorSwat force-pushed the @is/vad-streaming branch from 1c2411e to 0ea858d Compare May 21, 2026 08:55
@msluszniak msluszniak added the feature PRs that implement a new feature label May 21, 2026
@IgorSwat IgorSwat requested a review from benITo47 May 21, 2026 12:49
@msluszniak

This comment was marked as resolved.

Comment thread docs/docs/03-hooks/01-natural-language-processing/useSpeechToText.md Outdated
Comment thread docs/docs/04-typescript-api/01-natural-language-processing/VADModule.md Outdated
}
})();

while (this.isStreaming && !finished) {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stream() resolves as soon as this.isStreaming flips, but the native loop only re-checks the flag at the top of the next iteration — so for up to timeout + one inference after await streamStop() returns, the native streamer is still alive, can still queue callInvoker_->invokeAsync callbacks, and still touches audioBuffer_. If the caller then runs unload() (or the host object is destroyed) we're in UAF / use-after-unload territory.

Two options: (a) actually join — stream() doesn't resolve until the native stream() call returns, and streamStop() awaits that; or (b) document explicitly that unload() is not safe immediately after streamStop() and that callbacks may fire after the promise resolves. (a) is the safer contract.

@msluszniak msluszniak linked an issue May 21, 2026 that may be closed by this pull request

@chmjkb chmjkb left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

besides my previous comment, I think it looks good, great work!

@barhanc barhanc mentioned this pull request May 22, 2026
12 tasks

@msluszniak msluszniak left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review from local verification (VAD native tests pass 15/15, demo app boots). A few correctness items and minor cleanups inline.

msluszniak

This comment was marked as resolved.

msluszniak added 15 commits May 22, 2026 14:09
The size check on audioBuffer_ raced with streamInsert writes under
audioBufferMutex_. Move both the size comparison and the erase under
a single lock so the read isn't concurrent with vector mutation.
generate() ran unlocked against a std::span pointing into audioBuffer_,
relying on the vector's reservation never being exceeded. Unbounded
streamInsert from JS could grow the buffer past capacity, trigger
reallocation, and invalidate the span. Take a local copy under the
lock instead so the inference operates on stable data.
Previously `lastMerged.end = current.end` would shrink the merged
segment if a non-monotonic input arrived (current.end < lastMerged.end).
postprocess() doesn't produce such input today, but the safer form
removes the hidden invariant.
isStreaming stayed true after the native stream() resolved (whether
normally or via error), so subsequent code relying on the flag saw
stale state. Reset it in a finally block alongside the wake/finished
bookkeeping.
Mutating the in-flight `options` to flip `useVAD` off before the
final `finish()` call worked but left a footgun for anything later
that reads back `options.useVAD`. Build a local copy with the
override instead.
The function only reads from the span. Tagging it const signals
intent and matches the equivalent OnlineASR::insertAudioChunk
signature on the STT side.
`||` coerced an explicit `0` to the default 500. Switch to `??`
so callers can pass 0 to disable the margin.
Explain what the 1.2 multiplier means — widens the VAD merge window
relative to the user-configured detectionMargin so brief
intra-utterance silences don't split a single utterance into
separate segments.
`OnlineASR::process` computes the silence-trim cut as a `size_t`
subtraction of these two constants. If either is tweaked such that
the ordering inverts, the subtraction wraps and the subsequent
`erase` reads past the buffer. Lock the invariant in at compile
time.
SpeechToText gained a 4th positional `vadSource` argument; pass an
empty string at all 9 existing call sites so the test still exercises
the no-VAD path. Add the new VAD sources to the CMake target so the
binary links.
mergeSegments: empty input, single-segment passthrough, distant
segments stay separate, close/adjacent segments merge, overlapping
shorter inner doesn't shrink the result, mixed sequence merges
only adjacent close pairs.

stream/streamInsert/streamStop: stream() loop exits promptly on
streamStop, streamInsert while streaming doesn't crash, concurrent
stream() throws StreamingInProgress, and stream can be restarted
after a stop.
Covers the new PR behavior:
- valid vadSource constructs without throwing
- invalid vadSource fails loudly
- one-shot transcribe() is unaffected when VAD is loaded
- stream(useVAD=true) on a model built without VAD throws
- stream(useVAD=true) over pure-silence audio drives the VAD branch
  of OnlineASR::process and exits cleanly via streamStop()

Also register fsmn-vad in run_tests.sh so the SpeechToTextTests
runner pushes the VAD model alongside the Whisper artifacts.
Required by the const-span signature of VoiceActivityDetection::
streamInsert. Without this, ModelHostObject's template instantiation
of synchronousHostFunction<&VoiceActivityDetection::streamInsert>
references an undefined symbol at link time on Android.

Mirrors the existing std::span<float> specialization; the underlying
getTypedArrayAsSpan<float>() helper returns a span over the same
storage, which converts implicitly to span<const float>.
@msluszniak msluszniak self-requested a review May 22, 2026 12:39

@msluszniak msluszniak left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀

@IgorSwat IgorSwat merged commit 44fb986 into main May 22, 2026
5 checks passed
@IgorSwat IgorSwat deleted the @is/vad-streaming branch May 22, 2026 13:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature PRs that implement a new feature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Implement continuous voice activity detection

4 participants